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Description 

BACKGROUND OF THE INVENTION 

5 The present invention relates to a method and apparatus for providing temporal and spatial scaling of video images 
including video object planes in a digital video sequence. In particular, a motion compensation scheme is presented 
which is suitable for use with scaled frame mode or field mode video. A scheme for adaptively compressing field mode 
video using a spatial transformation such as the Discrete Cosine Transformation (DCT) is also presented. 

The invention is particularly suitable fa use with various multimedia applications, and is compatible with the MPEG- 
■io 4 Verification Model (VM) 7.0 standard described in document ISO/IEC/JTC1/2C29/WG1 1 N1642, entitled "MPEG-4 
Video Verification Model Version 7.0", April 1997, incorporated herein by reference. The invention can further provide 
coding of stereoscopic video, picture-in-picture, preview access channels, and asynchronous transfer mode (ATM) 
communications. 

MPEG-4 is a new coding standard which provides a flexible framework and an open set of coding tools tor commu- 
is nication, access, and manipulation of digital audio-visual data. These tools support a wide range of features. The flexi- 
ble framework of MPEG-4 supports various combinations of coding tools and their corresponding functionalities for 
applications required by the computer, telecommunication, and entertainment (i.e.. TV and film) industries, such as 
database browsing, information retrieval, and interactive communications. 

MPEG-4 provides standardized core technologies allowing efficient storage, transmission and manipulation of 
20 video data in multimedia environments. MPEG-4 achieves efficient compression, object scalability, spatial and temporal 
scalability, and error resilience. 

The MPEG-4 video VM coder/decoder (codec) is a block- and object-based hybrid coder with motion compensa- 
tion. Texture is encoded with an 8x8 DCT utilizing overlapped block-motion compensation. Object shapes are repre- 
sented as alpha maps and encoded using a Content-based Arithmetic Encoding (CAE) algorithm or a modified DCT 
2s coder, both using temporal prediction. The coder can handle sprites as they are known from computer graphics. Other 
coding methods, such as wavelet and sprite coding, may also be used for special applications. 

Motion compensated texture coding is a well known approach for video coding. Such an approach can be modeled 
as a three-stage process. The first stage is signal processing which includes motion estimation and compensation 
(ME/MC) and a 2-D spatial transformation. The objective of ME/MC and the spatial transtormation is to take advantage 
30 of temporal and spatial correlations in a video sequence to optimize the rate-cfistortion performance of quantization and 
entropy coding under a complexity constraint. The most common technique for ME/MC has been block matching, and 
the most common spatial transformation has been the DCT. However, special concerns arise for ME/MC and DCT cod- 
ing of the boundary blocks of an arbitrarily shaped VOP. 

The MPEG-2 Main Profile is a precursor to the MPEG-4 standard, and is described in document ISO/IEC 
35 JTC1/SC29/WG11 N0702. entitled "Information Technology - Generic Coding of Moving Pictures and Associated 
Audio. Recommendation H.262," March 25. 1994. incorporated herein by reference. Scalability extensions to the 
MPEG-2 Main Profile have been defined which provide for two or more separate bitstreams, or layers. Each layer can 
be combined to form a single high-quality signal. For example, the base layer may provide a lower quality video signal, 
while the enhancement layer provides additional information that can enhance the base layer image. 
40 In particular, spatial and temporal scalability can provide compatibility between different video standards or 
decoder capabilities. With spatial scalability, the base layer video may have a lower spatial resolution than an input 
video sequence, in which case the enhancement layer carries information which can restore the resolution of the base 
layer to the input sequence level. For instance, an input video sequence which corresponds to the International Tele- 
communications Union - Radio Sector (ITU-R) 601 standard (with a resolution of 720x576 pixels) may be carried in a 
45 base layer which corresponds to the Common Interchange Format (CIF) standard (with a resolution of 360x288 pixels). 
The enhancement layer in this case carries information which is used by a decoder to restore the base layer video to 
the ITU-R 601 standard. Alternatively, the enhancement layer may have a reduced spatial resolution. 

With temporal scalability, the base layer can have a lower temporal resolution (i.e., frame rate) than the input video 
sequence, while the enhancement layer carries the missing frames. When combined at a decoder, the original frame 
so rate is restored. 

Accordingly, it would be desirable to provide temporal and spatial scalability functions for coding of video signals 
which include video object planes (VOPs) such as those used in the MPEG-4 standard. It would be desirable to have 
the capability for coding of stereoscopic video, picture-in-picture. preview access channels, and asynchronous transfer 
mode (ATM) communications. 

55 It would further be desirable to have a relatively low complexity and low cost codec design where the size of the 
search range is reduced for motion estimation of enhancement layer prediction coding of bi-directionally predicted 
VOPs (B-VOPs). It would also be desirable to efficiently code an interlaced video input signal which is scaled to base 
and enhancement layers by adaptively reordering pixel lines of an enhancement layer VOP prior to determining a resi- 
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due and spatially transforming the data. The present invention provides a system having the above and other advan- 
tages. 

SUMMARY OF THE INVENTION 

In accordance with the present invention, a method and apparatus are presented for providing temporal and spatial 
scaling of video images such as video object planes ( VOPs) in a digital video sequence. The VOPs can comprise a full 
frame and/or a subset of the frame, and may be arbitrarily shaped. Additionally, a plurality of VOPs may be provided in 
one frame or otherwise be temporally coincident. 

A method is presented for scaling an input video sequence conrprising video object planes (VOPs) for communica- 
tion in a corresponding base layer and enhancement layer, where downsampled data is carried in the base layer. The 
VOPs in the input video sequence have an associated spatial resolution and temporal resolution (e.g.. frame rate). 

Pixel data of a first particular one of the VOPs of the input video sequence is downsampled to provide a first base 
layer VOP having a reduced spatial resolution. Pixel data of at least a portion of the first base layer VOP is upsampled 
to provide a first upsampled VOP in the enhancement layer. The first upsampled VOP is differentially encoded using the 
first particular one of the VOPs of the input video sequence, and provided in the enhancement layer at a temporal posi- 
tion corresponding to the first base layer VOP. 

The differential encoding includes the step of determining a residue according to a difference between pixel data of 
the first upsampled VOP and pixel data of the first particular one of the VOPs of the input video sequence The residue 
is spatially transformed to provide transform coefficients, for example, using the DCT. 

When the VOPs in the input video sequence are field mode VOPs, the differential encoding involves reordering the 
lines of the pixel data of the first upsampled VOP in a field mode prior to determining the residue if the lines of pixel data 
meet a reordering criteria. The criteria is whether a sum of differences of luminance values of opposite-field lines (e.g.. 
odd to even, and even to odd) is greater than a sum of differences of luminance data of same-field lines (e.g., odd to 
odd, and even to even) and a bias term. 

The upsampled pixel data of the first base layer VOP may be a subset of the entire first base layer VOP, such that 
a remaining portion of the first base layer VOP which is not upsampled has a lower spatial resolution than the upsam- 
pled pixel data. 

A second base layer VOP and upsampled VOP in the enhancement layer may be provided in a similar manner. One 
or both of the first and second base layer VOPs can be used to predict an intermediate VOP which corresponds to the 
first and second upsampled VOPs. The intermediate VOP is encoded for communication in the enhancement layer tem- 
porally between the first and second upsampled VOPs. 

Furthermore, the enhancement layer may have a higher temporal resolution than the base layer when there is no 
intermediate base layer VOP between the first and second base layer VOPs. 

In a specific application, the base and enhancement layer provide a picture-in-picture (PIP) capability where a PIP 
image is carried in the base layer, or a preview access channel capability, where a preview access image is carried in 
the base layer. In such applications, it is acceptable for the PIP image or free preview image to have a reduced spatial 
and/or temporal resolution. In an ATM application, higher priority, lower bit rate data may be provided in the base layer, 
while lower priority, higher bit rate data is provided in the enhancement layer. In this case, the base layer is allocated a 
guaranteed bandwidth, but the enhancement layer data may occasionally be lost. 

A method is presented for scaling an input video sequence comprising video object planes (VOPs) where down- 
sampled data is carried in the enhancement layer rather than the base layer. With this method, a first particular one of 
the VOPs of the input video sequence is provided in the base layer as a first base layer VOP. e.g., without changing the 
spatial resolution. Pixel data of at least a portion of the first base layer VOP is downsampled to provide a corresponding 
first downsampled VOP in the enhancement layer at a temporal position corresponding to the first base layer VOP. Cor- 
responding pixel data of the first particular one of the VOPs is downsampled to provide a comparison VOP, and the first 
downsampled VOP is differentially encoded using the comparison VOP 

Trie base and enhancement layers may provide a stereoscopic video capability in which image data in the 
enhancement layer has a lower spatial resolution than image data in the base layer. 

A method for coding a bi-directionally predicted video object plane (B-VOP) is also presented. First and second 
base layer VOPs are provided in the base layer which correspond to the input video sequence VOPs. The second base 
layer VOP is a P-VOP which is predicted from the first base layer VOP according to a motion vector MVp. The B-VOP 
is provided in the enhancement layer temporally between the first and second base layer VOPs. 

The B-VOP is encoded using at least one of a forward motion vector MV f and a backward motion vector MV B which 
are obtained by scaling the motion vector M V p . This efficient coding technique avoids the need to perform an independ- 
ent exhaustive search in the reference VOPs. A temporal distance TRp separates the first and second base layer VOPs, 
while a temporal distance TR B separates the first base layer VOP and the B-VOP. 

A ratio m/n is defined as the ratio of the spatial resolution of the first and second base layer VOPs to the spatial 
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resolution of the B-VOR That is, either the base layer VOPs or the B-VOP in the enhancement layer may be downsam- 
pled relative to the VOPs of the input video sequence by a ratio m/n. rt is assumed that either the base or enhancement 
layer VOP has the same spatial resolution as the input video sequence. The forward motion vector MV f is determined 
according to the relationship MV, =(m/n) • TR B • MV p /TR p , while the backward motion vector MV b is determined 
according to the relationship MV b =(m/n) • (TR B -TR p ) • MV p /TR p . m/h is any positive number, including fractional val- 
ues. 

The B-VOP is encoded using a search region of the first base layer VOP whose center is determined according to 
the forward motion vector MV,, and a search region of the second base layer VOP whose center is determined accord- 
ing to the backward motion vector MV B 

Conesponding decoder methods and apparatus are also presented. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIGURE 1 is an illustration of a video object plane (VOP) coding and decoding process in accordance with the 
present invention. 

FIGURE 2 is a block diagram of a VOP coder and decoder in accordance with the present invention. 
FIGURE 3 is an illustration of pixel upsampiing in accordance with the present invention. 
FIGURE 4 is an illustration of an example of the prediction process between VOPs in a base layer and enhance- 
ment layer. 

FIGURE 5 is an illustration of spatial and temporal scaling of a VOP in accordance with the present invention. 
FIGURE 6 illustrates the reordering of pixel lines from frame to field mode in accordance with the present invention. 
FIGURE 7 is an illustration of a picture-in-picture (PIP) or preview channel access application with spatial and tem- 
poral scaling in accordance with the present invention. 

FIGURE 8 is an illustration of a stereoscopic video application in accordance with the present invention. 

DETAILED DESCRIPTION OF THE INVENTION 

A method and apparatus are presented for providing temporal and spatial scaling of video images including video 
object planes (VOPs) in a digital video sequence. 

FIGURE t is an illustration of a video object coding and decoding process in accordance with the present invention. 
Frame 1 05 includes three pictorial elements, including a square foreground element 1 07, an oblong foreground element 
108. and a landscape backdrop element 109. In frame 1 15, the elements are designated VOPs using a segmentation 
mask such that VOP 117 represents the square foreground element 107, VOP 118 represents the oblong foreground 
element 108, and VOP 1 19 represents the landscape backdrop element 109. A VOP can have an arbitrary shape, and 
a succession of VOPs is known as a video object. A full rectangular video frame may also be considered to be a VOP. 
Thus, the term 'VOP" will be used herein to incficate both arbitrary and non-arbitrary image area shapes. A segmenta- 
tion mask is obtained using known techniques, and has a format similar to that of ITU-R 601 luminance data. Each pixel 
is identified as belonging to a certain region in the video frame. 

The frame 1 05 and VOP data from frame 1 1 5 are supplied to separate encoding functions. In particular, VOPs 117, 
118 and 119 undergo shape, motion and texture encoding at encoders 137, 138 and 139, respectively. With shape cod- 
ing, binary and gray scale shape information is encoded. With motion coding, the shape information is coded using 
motion estimation within a frame. With texture coding, a spatial transformation such as the DCT is performed to obtain 
transform coefficients which can be variable-length coded for compression. 

The coded VOP data is then combined at a multiplexer (MUX) 140 for transmission over a channel 145. Alterna- 
tively, the data may be stored on a recording medium. The received coded VOP data is separated by a demultiplexer 
(DEMUX) 150 so that the separate VOPs 117-1 19 are decoded and recovered. Frames 155, 165 and 175 show that 
VOPs 1 1 7, 1 1 8 and 1 1 9, respectively, have been decoded and recovered and can therefore be individually manipulated 
using a compositor 1 60 which interfaces with a video library 1 70, for example. 

The compositor may be a device such as a personal computer which is located at a user's home to allow the user 
to edit the received data to provide a customized image. For example, the user's personal video library 1 70 may include 
a previously stored VOP 1 78 (e.g.. a circle) which is different than the received VOPs. The user may compose a frame 
185 where the circular VOP 178 replaces the square VOP 117. The frame 185 thus includes the received VOPs 1 18 
and 1 19 and the locally stored VOP 1 78. 

In another example, the background VOP 1 09 may be replaced by a background of the user's choosing. For exam- 
ple, when viewing a television news broadcast, the announcer may be coded as a VOP which is separate from the back- 
ground, such as a news studio. The user may select a background from the library 170 or from another television 
program, such as a channel with stock price or weather information. The user can therefore act as a video editor. 

The video library 170 may also store VOPs which are received via the channel 145. and may access VOPs and 
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other image elements via a network such as the Internet 

It should be appreciated that the frame 1 05 may include regions which are not VOPs and therefore cannot be indi- 
vidually manipulated. Furthermore, the frame 105 need not have any VOPs. Generally, a video session comprises a sin- 
gle VOP. or a sequence of VOPs. 

5 The video object coding and decoding process of FIGURE 1 enables many entertainment, business and educa- 
tional applications, including personal computer games, virtual environments, graphical user interfaces, videoconfer- 
encing. Internet applications and the like. In particular, the capability for spatial and temporal scaling of the VOPs in 
accordance with the present invention provides even greater capabilities. 

FIGURE 2 is a block diagram of a video object coder and decoder in accordance with the present invention. The 

w encoder 201, which corresponds to elements 137-139 shown schematically in FIGURE 1, includes a scalability pre- 
processor 205 which receives an input video data sequence In". To achieve spatial scalability with the base layer hav- 
ing a lower spatial resolution than the enhancement layer, "in" is spatially downsampled to obtain the signal "in_0", 
which is. in turn, provided to a base layer encoder 220 via a path 21 7. "in_0" is encoded at the base layer encoder 220, 
and the enooded data is provided to a multiplexer (MUX) 230. A MPEG-4 System and Description Language (MSDL) 

15 MUX may be used. 

Note that, when the input video sequence "in" is in field (interlaced) mode, the downsampled signal InJT will be 
in frame (progressive) mode since downsampling does not preserve the pixel data in even and odd fields. Of course, 
"in_0" will also be in frame mode when "in" is in frame moda 

Trie reconstructed image data is provided from the base layer encoder 220 to a micferocessor 215 via a path 218 

20 which may perform pixel upsampling, as discussed below in greater detail in connection with FIGURE 3. The upsam- 
pled image data, which is in frame mode, is then provided to an enhancement layer encoder 210 via a path 212. where 
it is differentially encoded using the input image data "in J " provided from the preprocessor 205 to the encoder 210 via 
a path 207. In particular, the upsampled pixel data (e.g.. luminance data) is subtracted from the input image data to 
obtain a residue, which is then encoded using the DCT or other spatial transformation. 

25 In accordance with the present invention, when the input video sequence is in field mode, coding efficiency can be 
improved by grouping the pixel lines of the upsampled enhancement layer image which correspond to the original even 
(top) and odd (bottom) field of the input video sequence. This can decrease the magnitude of the residue in some cases 
since pixel data within a field will often have a greater correlation with other pixel data in the same field than with the 
data in the opposite field. Thus, by reducing the magnitude of the residue, fewer bits are required to code the image 

30 data. Refer to FIGURE 6 and the associated discussion, below, for further details. 

. The encoded residue of the upsampled image in the enhancement layer is provided to the MUX 230 for transmis- 
sion with the base layer data over a communication channel 245. Trie data may alternatively be stored locally. Note that 
the MUX 230, channel 245, and DEMUX 250 correspond, respectively, to elements 140, 145 and 150 in FIGURE 1. 
Note that the image data which is provided to the midprocessor 215 from the base layer encoder 220 may be the 

35 entire video image, such as a full-frame VOP, or a VOP which is a subset of the entire image. Moreover, a plurality of 
VOPs may be provided to the midprocessor 215. MPEG-4 currently supports up to 256 VOPs. 

At a decoder 299, the encoded data is received at a demultiplexer (DEMUX) 250, such as an MPEG-4 MSDL 
DEMUX. The enhancement layer data, which has a higher spatial resolution than the base layer data in the present 
example, is provided to an enhancement layer decoder 260. The base layer data is provided to a base layer decoder 

40 270, where the signal "out_0" is recovered and provided to a midprocessor 265 via a path 267, and to a scalability post- 
processor 280 via a path 277. The midprocessor operates in a similar manner to the midprocessor 21 5 on the encoder 
side by upsampling the base layer data to recover a full-resolution image. This image is provided to the enhancement 
layer decoder 260 via a path 262 tor use in recovering the enhancement layer data signal "outj", which is then pro- 
vided to the scalability postprocessor 280 via path 272. The scalability postprocessor 280 performs operations such as 

45 spatial upsampling of the decoded base layer data for display as signal "outp_0". while the enhancement layer data is 
output for display as signal "outp_1 

When the encoder 201 is used for temporal scalability, the preprocessor 205 performs temporal demultiplexing 
(e.g.. pulldown processing or frame dropping) to reduce the frame rate, e.g., for the base layer. For example, to 
decrease the frame rate from 30 frames/sec. to 1 5 frames/sec. , every other frame is dropped. 

so Table 1 below shows twenty-four possible configurations of the midprocessors 2 1 5 and 265, scalability preproces- 
sor 205 and scalability postprocessor 280. 



55 



5 



EP0 883 300 A2 



w 



15 



20 



25 



30 



35 



40 



45 



50 



5 



Scalability 
Postprocessor 


N/C 


O/N 


O/N 


N/C 


N/C 


N/C 


Midprocessor 


Upsample 
Filtering 


V/N 


Upsample 
Filtering 


V/N 


Upsample 
Filtering 


V/N 


Scalability 
Preprocessor 


Downsample 
Filtering 


N/C 


Downsample 
Filtering and 
Pulldown Processing 


N/C 


Downsample 
Filtering 


Pulldown Processing 


Spatial 
Resolution 


Low 


High 


Low 


High 


Low 


High 


Temporal 
Resolution 


CD 

X 

1 


if 


Low 


High 


High 


Low 


Layer 


Base 


Enhance- 
ment 


Base 


Enhance- 
Ment 


Base 


Enhance- 
ment 


Configuration 






CM 




CO 





55 



EP0 883 300 A2 



20 



30 



50 



I 2 



2 



a 



O 

0) 



c 
o 

ll 

♦3 O 
(0 (0 

a. a> 
co 



c 

j3 
o 

</> 

© 

a: 



ii 



© 



I 



c 
o 

1 

O) 

! 

o 



z 



c 

'C 
CD 



a 
z 



.c 



a a> 

8 I 

Q. ~ 

3 U. 



Q U. 



a> 



c 

to ~ 

£ I 

uj E 



a ii 



x: 



CO 



© 

£ C 

a ~ 

3 u. 




-C 
CD 



C 

a - 

c © 

ui £ 



© 

a 

E 
© a> 

tfi c 

$ © 
o ~ 
Q u. 



O 
z 



© 

a cd 

S § 

to © 

a £ 

3 u. 



09 

c 

CO 



a c q. 

CO 



i 

to 



c 

111 

Qui 



X 



JO 



© 

(0 

© 

00 



(0 



F 



1 1 

u3 £ 



7 



EP 0 883 300 A2 



f 8 
S 8 

CO Q_ 



20 



30 



40 



50 



Q. 
T3 



O 

| 8 

-= Q. 

s s 

CO 0. 



c 

.2 

li 

co (o 

Q. Q 

CO 0£ 



c 
2 2 

Is 

•5 



© 

3 



3 
CD 

"E 
O 



O 



i 



r: 



(/) 
(0 
CD 



O 

z 



i 



x: 



O) 



c 

CD ~ 

.C c 

uj E 



o 
2 



00 



a> 

if 



X 



c 

" S 

c 5 

uj E 



O 
2 



i 



a> 



.c 



CD 
(/> 
CO 
CD 



O 
2 



< 

z 



© 

a ao 
E c 

CD 

3 il 



O 
2 



r 



C 

£ 1 



© 

i g 

i ^ 

Q il 



x: 

C9 



© 

CO 
CO 
GO 



© 

a g 

(0 © 

a s 



z 



a 
co co 



Q il 



CD 



C 

5 c 

C 0) 

LU E 



55 



8 



EP 0 883 300 A2 



5 



15 



30 



35 



45 



50 





Upsample 
Filtering 


Upsample 
Filtering 


Upsample 
Filtering 


Upsample 
Filtering 


1 

1 

c 
1 


I 

I 
I 


N/C 


N/A 


N/C 


N/A 


Scalability 
Preprocessor 


Downsample 
Filtering and 
Pulldown Processing 


Downsample 
Filtering 


Downsample 
Filtering 


Downsample 
Filtering and 
Pulldown Processing \ 


Spatial 
Resolution 


Low 


Low 


Low 


Low 


Temporal 
Resolution 


i 


High 


High 


Low 


Layer 


Base 


Enhance- 
ment 


Base 


Enhance- 
ment 


Configuration 






CM 





In Table 1 , the first column indicates the configuration number, the second column indicates the layer, and the third 
column indicates the temporal resolution of the layer (e.g., either high or low). When "Low(High)" is listed, the temporal 
resolution of the base and enhancement layers is either both high or both tow. The fourth column indicates the spatial 
resolution. The fifth, sixth and seventh columns indicate the corresponding action of the scalability preprocessor 205, 
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midprocessor 215 and 265, and scalability postprocessor 280. "N/C" denotes no change in temporal or spatial resolu- 
tion, i.e.. normal processing is performed. "N/A" means "not applicable. " The micfcrocessor 215, 265 actions do not 
affect the enhancement layer. 

Spatially scaled coding is illustrated using configuration 1 as an example. As discussed, when the scaleaWe coder 
201 is used to code a VOP. the preprocessor 205 produces two substreams of VOPs with different spatial resolutions. 
As shown in Table 1, the base layer has a low spatial resolution, and the enhancement layer has a high spatial resolu- 
tion which corresponds to the resolution of the input sequence. Therefore, the base-layer sequence "in_0" is generated 
by a downsampling process of the input video sequence In" at the scalability preprocessor 205. The enhancement 
layer sequence is generated by upsample filtering of the downsampled base layer sequence at the midprocessors 215. 
265 to achieve the same high spatial resolution of In". The postprocessor 280 performs normal processing, i.e.. it does 
not change the temporal or spatial resolution of "out_r or "outJT. 

For example, a base layer CIF resolution sequence (360x288 pixels) can be generated from a 2:1 downsample fil- 
tering of an ITU-R 601 resolution input sequence (720x576 pixels). Downsampling by any integral or non-integral ratio 
may be used. 

Temporally and spatially scaled coding is illustrated using configuration 2 as an example. Here, the input video 
sequence In", which has a high spatial and temporal resolution, is converted to a base layer sequence having a low 
spatial and temporal resolution, and an enhancement layer sequence having a high spatial and temporal resolution. 
This is accomplished as indicated by Table 1 by performing downsample filtering and pulldown processing at the pre- 
processor 205 to provide the signal TnjD", with upsample filtering at the midprocessors 215. 265 and normal process- 
ing at the postprocessor 280. 

With configuration 3, the input video sequence "in", which has a low or high temporal resolution, and a high spatial 
resolution, is converted to a base layer sequence having a corresponding low or high temporal resolution, and a high 
spatial resolution, and an enhancement layer sequence having a corresponcfing low or high temporal resolution, and a 
low spatial resolution. This is accomplished by performing downsample filtering for the enhancement layer sequence 
"in_r at the preprocessor 205, with downsample filtering at the midprocessors 215, 265, and upsample filtering for the 
enhancement layer sequence "out_1 " at the postprocessor 280. 

The remaining configurations can be understood in view of the foregoing examples. 

FIGURE 3 is an illustration of pixel upsampling in accordance with the present invention. Upsampling filtering may 
be performed by the midprocessors 215, 265 with configuration 1 of Table 1. For example, a VOP having a CIF resolu- 
tion (360x288 pixels) can be converted to an ITU-R 601 resolution (720x576 pixels) with 2:1 upsampling. Pixels 310, 
320, 330 and 340 of the CIF image are sampled to produce pixels 355, 360, 365, 370, 375, 380, 385 and 390 of the 
ITU-R 601 image. In particular, an ITU-R 601 pixel 360 is produced by sampling CIF pixels 310 and 320 as shown by 
arrows 312 and 322, respectively. Similarly, an ITU-R 601 pixel 365 is also produced by sampling CIF pixels 310 and 
320, as shown by arrows 314 and 324, respectively. 

FIGURE 4 is an illustration of an example of the prediction process between VOPs in the base layer and enhance- 
ment layer. In the enhancement layer encoder 21 0 of FIGURE 2, a VOP of the enhancement layer is encoded as either 
a P-VOP or B-VOP In this example. VOPs in the enhancement layer have a greater spatial resolution than base layer 
VOPs and are therefore drawn with a larger area. The temporal resolution (e.g.. frame rate) is the same for both layers. 
The VOPs are shown in presentation order from left to right. 

The base layer includes an I- VOP 405, B-VOPs 415 and 420. and a P-VOP 430. The enhancement layer includes 
P-VOPs 450 and 490, and B-VOPs 460 and 480. B-VOP 415 is predicted from other base layer VOPs as shown by 
arrows 410 and 440, while B-VOP 420 is also predicted from the other base layer VOPs as shown by arrows 425 and 
435. P-VOP 430 is predicted from l-VOP 405 as shown by arrow 445. P-VOP 450 is derived by upsampling a base layer 
VOP indicated by arrow 455, while P-VOP 490 is derived from upsampling a base layer VOP indicated by arrow 495. B- 
VOP 460 is predicted from base layer VOPs as shown by arrows 465 and 475, and B-VOP 480 is predicted from base 
layer VOPs as shown by arrows 470 and 485. 

Generally, the enhancement layer VOP which is temporally coincident (e.g., in display or presentation order) with 
an l-VOP in the base layer is encoded as a P-VOR For example, VOP 450 is temporally coincident with l-VOP 405, and 
is therefore coded as a P-VOP. The enhancement layer VOP which is temporally coincident with a P-VOP in the base 
layer is encoded as either a P- or B-VOP. For example, VOP 490 is temporally coincident with P-VOP 430 and is coded 
as a P-VOP. The enhancement layer VOP which is temporally coincident with a B-VOP in the base layer is encoded as 
a B-VOR For example, see B-VOPs 460 and 480. 

l-VOP 405 and P-VOP 430 are known as anchor VOPs since they are used as prediction references for the 
enhancement layer VOPs. l-VOP 405 and P-VOP 430 are therefore coded before the encoding of the corresponding 
predicted VOPs in the enhancement layer, The prediction reference of a P-VOP in the enhancement layer is specified 
by the forward (prediction) temporal reference indicator forwardjemporaLref in an MPEG-4 compatible syntax. Such 
an indicator is a non-negative integer which points to the temporally coincided l-VOP in the base layer. The prediction 
references of B-VOPs in the enhancement layer are specified by ref_select_code, forward_temporal_ref and 
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backward JemporaLref. See Table 2, below. Note that the table is different lor MPEG-2 and MPEG-4 VM 3.0 scalability 
schemes 



Table 2 



ref_select_code 


forward temporal refer- 
ence VOP 


backward temporal refer- 
ence VOP 


00 


base layer 


base layer 


01 


base layer 


enhancement layer 


10 


enhancement layer 


base layer 


11 


enhancement layer 


enhancement layer 



15 

Table 2 shows the prediction reference choices for B-VOPs in the enhancement layer. Fa example, assume that 
the temporal reference code tempora!_ref for I- VOP 405 and P-VOP 430 in the base layer are 0 and 3, respectively. 
Also, let the temporal_ref for P-VOP 450 in the enhancement layer be 0. Then, in FIGURE 4, forwardjemporal_ref=0 
for P-VOP 450. The prediction references of B-VOPs 460 and 480, given by arrows 465 and 475, 470 and 485, respec- 
20 tively, are specified by ref_select_code=00, forward JernporaUefsO. and backward LtemporaLref=3. The prediction 
references of P-VOP 490 are specified by ref_seJect_code=10, forward JemporaLref =0 and 
backward_temporaJ_ref=3. 

In coding both the base and enhancement layers, the prediction mode is indicated by a 2-bit word 
VOP_predictionJype given by Table 3, below. 

25 



Table 3 



VOPjredictionJype 


Code 


I 


00 


P 


01 


B 


10 



35 An "I" prediction type indicates an intra-coded VOP. a "P" prediction type indicates a predicted VOP. and a "B" prediction 
type indicates a bi-directionally predicted VOR The encoding process for the sequence "in_0" of the base layer is the 
same as a non-scaleable encoding process, e.g., according to the MPEG-2 Main profile or H.263 standard. 

FIGURE 6 illustrates the reordering, or permutation, of pixel lines from frame to field mode in accordance with the 
present invention. As mentioned, when an input VOP is in field mode and is downsampled. the resulting VOP will be in 

40 frame mode. Accordingly, when the downsampled image is spatially upsampled, the resulting VOP will also be in frame 
mode. However, when the upsampled VOP is differentially encoded by subtracting the input VOP from upsampled VOP. 
the resulting residue may not yield an optimal coding efficiency when a spatial transformation such as the OCT is sub- 
sequently performed on the residue. That is, is many cases, the magnitude of the residue values can be reduced by 
permuting (i.e.. reordering) the lines of the upsampled image to group the even and odd lines since there may be a 

45 greater correlation between same-field pixels than opposite-field pixels. 

An image which may represent upsampled pixel (e.g., luminance) data in an enhancement layer is shown generally 
at 600. For example, assume the image 600 is a 1 6x1 6 macroblock which is derived by 2: 1 upsampling of an 8x8 block. 
The macroblock includes even numbered lines 602, 604, 606, 608, 610, 612, 614 and 616, and odd-numbered lines 
603. 605, 607. 609, 61 1, 613, 615 and 61 7. The even and odd lines form top and bottom fields, respectively. The mac- 

so roblock 600 includes four 8x8 luminance blocks, including a first block defined by the intersection of region 620 and lines 
602-609. a second block defined by the intersection of region 625 and lines 602-609, a third block defined by the inter- 
section of region 620 and lines 61 0-61 7, and a fourth block defined by the intersection of region 625 and lines 610-617. 

When the pixel lines in image 600 are permuted to form same-field luminance blocks in accordance with the 
present invention prior to determining the residue and performing the DCT, the macroblock shown generally at 650 is 

55 formed. Arrows, shown generally at 645, indicate the reordering of the lines 602-61 7. For example, the even line 602. 
which is the first line of macroblock 600. is also the first line of macroblock 650. The even line 604 is made the second 
line in macroblock 650. Similarly, the even lines 606. 608, 610, 612, 614 and 616 are made the third through eighth 
lines, respectively, of macroblock 650. Thus, a 16x8 luminance region 680 with even-numbered lines is formed. A first 
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8x8 block is defined by the intersection of region 680 and 670, while a second 8x8 block is defined by the intersection 
of regions 680 and 675. 

Similarly, the odd-numbered lines are moved to a 16x8 region 685. The region 685 comprises a first 8x8 block 
defined by the intersection of region 685 and 670, while a second 8x8 block is defined by the intersection of regions 685 

5 and 675. Region 685 thus includes odd lines 603, 605, 607, 609, 611,61 3. 61 5 and 61 7. 

The DCT which is performed on the residue is referred to herein as either "field OCT or "frame DCT" or the like 
according to whether or not the macroblock 600 is reordered as shown at macroblock 650. However, it should be appre- 
ciate that the invention may be adapted for use with other spatial transformations. When field DCT is used, the lumi- 
nance lines (or luminance error) in the spatial domain of the macroblock are permuted from a frame DCT orientation to 

w the top (even) and bottom (odd) field DCT configuration. The resulting macroblocks are transformed, quantized and var- 
iable length encoded normally. When a field DCT macroblock is decoded, the inverse permutation is performed after all 
luminance blocks have been obtained from the inverse DCT (IDCT). The 4:2:0 chrominance data is not effected by this 
mode. 

The criteria for selecting field or frame mode DCT in accordance with the present invention is as follows. Field DCT 
15 should be selected when: 

6 15 6 15 

/■oy-o z»o/.o 

20 

where py is the spatial luminance difference (e.g., residue) data just before the DCT is performed on each of the 8x8 
luminance blocks. Advantageously, the equation uses only first-order differences and therefore allows a simpler and 
less expensive implementation. The term "bias" is a factor which accounts for nonlinear effects which are not consid- 

25 ered. For example, bias=64 may be used. If the above relationship does not hold, frame DCT is used. 

Note that, in the left hand side of the above equation, the error terms refer to opposite-field pixel differences (e.g., 
even to odd. and odd to even). Thus, the left hand side is a sum of differences of luminance values of opposite-field 
lines. On the right hand side, the error terms are referring to same-field pixel differences (e.g., even to even, and odd to 
odd). Thus, the right hand side is a sum of differences of luminance data of same-field lines and a bias term. 

30 Alternatively, a second order equation may be used to determine whether frame or field DCT should be used by 
modifying the above equation to take the square of each error term rather than the absolute value. In this case, the 
"bias" term is not required. 

FIGURE 5 is an illustration of spatial and temporal scaling of a VOP in accordance with the present invention. With 
object-based scalability, the frame rate and spatial resolution of a selected VOP can be enhanced such that it has a 
35 higher quality than the remaining area. e.g. , the frame rate and/or spatial resolution of the selected object can be higher 
than that of the remaining area. For example, a VOP of a news announcer may be provided with a higher resolution than 
a studio backdrop. 

Axes 505 and 506 indicate a frame number. In the base layer, frame 510 which includes VOP 520 is provided in the 
frame 0 position, while frame 530 with VOP 532 (corresponding to VOP 520) is provided in the frame 3 position. Fur- 
40 thermore, frame 530 is predicted from frame 510, as shown by arrow 512. The enhancement layer includes VOPs 522, 
524, 526 and 542. These VOPs have an increased spatial resolution relative to VOPs 520 and 532 and therefore are 
drawn with a larger area. 

P-VOP 522 is derived from upsampling VOP 520. as shown by arrow 570. B-VOPs 524 and 526 are predicted from 
base layer VOPs 520 and 532. as shown by arrows 572 and 576, and 574 and 578, respectively. 

45 The input video sequence which is used to create the base and enhancement layer sequences has full resolution 
(e.g. 720x480 for ITU-R 601 corresponding to National Television Standards Committee (NTSC) or 720x576 for ITU-R 
corresponding to Phase Alternation Line (PAL)) and full frame rate (30 frames /60 fields for ITU-R corresponding to 
NTSC or 25 frames/50 fields for ITU-R 601 corresponding to PAL). Scaleable coding is performed such that the resolu- 
tion and frame rate of objects are preserved by using the enhancement layer coding. The video object in the base layer, 

so comprising VOPs 520 and 532, has a lower resolution (e.g. quarter size of the full resolution VOP) and a lower frame 
rate (e.g. one third of the original frame rate). Moreover, in the enhancement layer, only the VOP 520 is enhanced. The 
remainder of the frame 510 is not enhanced. While only one VOP is shown, virtually any number of VOPs may be pro- 
vided. Moreover, when two or more VOPs are provided, all or only selected ones may be enhanced. 

The base layer sequence is generated by downsampling and frame-dropping of the original sequence. The base 

55 layer VOPs are then coded as I- VOPs or P- VOPs by using progressive coding tools. When the input video sequence is 
interlaced, interlaced coding tools such as field/frame motion estimation and compensation, and fieldfirame DCT are 
not used since downsampling of the input interlaced video sequence produces a progressive video sequence. The 
enhancement layer VOPs are coded using temporal and spatial scaleable tools. For example, in the enhancement layer, 
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VOP 522 and VOP 542 are coded as P-VOPs using spatial scalability. VOP 524 and VOP 526 are coded as B-VOPs 
from the upsampfed VOPs of the base layer reference VOPs, i.e., VOP 520 and VOP 532, respectively, using temporal 
scaleabJe tools. 

In a further aspect of the present invention, a technique is disclosed for reducing encoding complexity for motion 
5 estimation of 8- VOPs by reducing the motion vector search range. The technique is applicable to both frame mode and 
field mode input video sequences. In particular, the searching center of the reference VOP is determined by scaling the 
motion vector of the corresponding base layer VOP rather than by performing an independent exhaustive search in the 
reference VOP. Such as exhaustive search would typically cover a range, for example, of +/- 64 pixels horizontally, and 
+/- 46 pixels vertically, and would therefore be less efficient than the disclosed technique. 
io The searching center for motion vectors of B-VOPs 524 and 526 in the enhancement layer is determined by: 

MV|=(m/n • TR B • MV p )/TR p 

MV b =(m/n • (TR B -TR p ) • MV p )/TR p 

Where MV f is the forward motion vector, MV b is the backward motion vector. MV p is the motion vector for the P-VOP 
(e.g. VOP 532) in the base layer. TR B is the temporal distance between the past reference VOP (e.g., VOP 520) and 
the current B-VOP in the enhancement layer, and TRp is the temporal distance between the past reference VOP and 
the future reference P-VOP (e.g., VOP 532) in the base layer, m/n is the ratio of the spatial resolution of the base layer 

20 VOPs to the spatial resolution of the enhancement layer VOPs. TTiat is, either the base layer VOPs or the B-VOP in the 
enhancement layer may be downsampled relative to the input video sequence by a ratio m/n. In the example of FIGURE 
5, m/n is the downsampling ratio of the base layer VOP which is subsequently upsampled to provide the enhancement 
layer VOP mm may be less than, equal to. or greater than one. For example, for B-VOP 524. TR B »1, TRpa3, and 2:1 
downsampling (i.e., m/n=2)„ we have MV P 2/3 MVp, and MV b =-4/3 MVp. Note that all of the motion vectors are two- 

25 dimensional. The motion vector searching range is a 16x16 rectangular region, for example, whose center is deter- 
mined by MV ( and MV b . The motion vectors are communicated with the enhancement and base layer video data in a 
transport data stream, and are recovered by a decoder for use in decoding the video data. 

Generally, for coding of interlaced video in accordance with the present invention, interlaced coding tools are used 
to achieve better performance. These tools include Field/Frame DCT for intra-macroblocks and inter-difference macrob- 

30 locks, and field prediction, i.e.. top field to bottom field, top field to top field, bottom field to top field and bottom field to 
bottom field. 

For the configurations described above in Table 1 , above, these interlaced coding tools are combined as follows. 

(1) For the configurations with low spatial resolution for both layers, only progressive (frame mode) coding tools are 
35 used. In this case, the two layers will code different view sequences, for example, in a stereoscopic video signal. 

For coding stereoscopic video, the motion estimation search range for the right-view (enhancement layer) 
sequence is 8x8 pixels. This 8x8 (full-pixel) search area is centered around the same-type motion vectors of a cor- 
responding macroblock in the base layer of the corresponding VOP. 

(2) For the configurations with low spatial resolution in the base layer and high spatial resolution in the enhance- 
40 ment layer, interlaced coding tools will only be used for the enhancement layer sequences. The motion estimation 

search range for coding the enhancement layer sequence is 8x8 (full-pixel). This 8x8 search area is centered 
around the re-scaled (i.e., a factor of two) same-type motion vectors of corresponding macroblock in the base layer 
of the corresponding VOP. Field based estimation and prediction will be used only in the enhancement layer search 
and compensation. 

45 (3) For the configurations with high spatial resolution in the base layer and low spatial resolution in the enhance- 
ment layer, interlaced cooing tools will only be used for the base layer sequences, as with the MPEG-2 Main Profile 
at the Main Level. The motion estimation search range for coding the enhancement layer sequence is 4x4 (full- 
pixel). This 4x4 search is centered around the re-scaled (i.e., a factor of 1/2) same-type motion vectors of corre- 
sponding macroblock in the base layer of the corresponding VOP. For configuration 2 in Table 1 . above, for exam- 

so pie, the coding of the sequences of two layers has a different temporal unit rate. 

FIGURE 7 is an illustration of a picture-in-picture (PIP) or preview channel access application with spatial and tem- 
poral scaling in accordance with the present invention. With PIP, a secondary program is provided as a subset of a main 
program which is viewed on the television. Since the secondary program has a smaller area, the viewer is less discern- 
55 ing of a reduced resolution image, so the temporal and/or spatial resolution of the PIP image can be reduced to con- 
serve bandwidth. 

Similarly, a preview access channel program may provide a viewer with a free low-resolution sample of a program 
which may be purchased for a fee. This application provides a few minutes of free access of an authorized channel 
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(e.g., Pay-Per-View) for a preview. Video coded in the preview access channel will have lower resolution and lower 
frame rate. The decoder will control the access time for such a preview channel. 

Configuration 2 of the temporal-spatial scaleable coding in Table 1 , above, may be used to provide an output from 
decoding the base layer that has a lower spatial resolution than the output from decoding both the base layer and the 
enhancement layer. The video sequence in the base layer can be coded with a low frame rate, while the enhancement 
layer is coded with a higher frame rate. 

For example, a video sequence in the base layer can have a CIF resolution and a frame rate of 15 frames/second, 
while the corresponding video sequence in the enhancement layer has an ITU-R 601 resolution and a frame rate of 30 
frames/second. In this case, the enhancement layer may conform to the NTSC video standard, while PIP or preview 
access functionality is provided by the base layer, which may conform to a CIF standard. Accordingly, PIP functionality 
can be provided by scaleable coding with a similar coding complexity and efficiency as the MPEG-2 Main Profile at Main 
Level standard. 

The base layer includes low spatial resolution VOPs 705 and 730. Moreover, the temporal resolution of the base 
layer is 1/3 that of the enhancement layer. The enhancement layer includes high spatial resolution VOPs 750, 760, 780 
and 790. P-VOP 750 is derived by upsampling l-VOP 705. as shown by arrow 755. B-VOP 760 is predicted from the 
base later VOPs as shown by arrows 765 and 775. B-VOP 780 is predicted from the base later VOPs as shown by 
arrows 770 and 785. P-VOP 790 is derived by upsampling P-VOP 730. as shown by arrow 795. 

FIGURE 8 is an illustration of a stereoscopic video application in accordance with the present invention. Stereo- 
scopic video functionality is provided in the MPEG-2 Multi-view Profile (MVP) system, described in document ISO/IEC 
JTC1/SC29/WG11 N1 196. The base layer is assigned to the left view and the enhancement layer is assigned to the 
right view. 

To improve cocfing efficiency, the enhancement layer pictures can be coded with a lower resolution than the base 
layer. For example, configuration 4 in Table 1, above, can be used where the base layer has a ITU-R 601 spatial reso- 
lution, while the enhancement layer has a CIF spatial resolution. The reference pictures of the base layer for the predic- 
tion of the enhancement layer pictures are downsampled. Accordingly, the decoder for the enhancement layer pictures 
includes an upsampling process. Additionally, adaptive frame/field DCT coding is used in the base layer but not the 
enhancement layer. 

The base layer includes VOPs 805. 815. 820 and 830, while the enhancement layer includes VOPs 850, 860. 880 
and 890. B-VOPs 81 5 and 820 are predicted using other base layer VOPs as shown by arrows 81 0, 840, and 835, 825, 
respectively. P-VOP 830 is predicted from l-VOP 805 as shown by arrow 845. P-VOP 850 is derived by downsarnpling 
l-VOP 805, as shown by arrow 855. B-VOP 860 is predicted from the base later VOPs as shown by arrows 865 and 875. 
B-VOP 880 is predicted from the base later VOPs as shown by arrows 870 and 885. P-VOP 890 is derived by down- 
sampling P-VOP 830, as shown by arrow 895. 

Alternatively, for the base and enhancement layers to have the same spatial resolution and frame rate, configura- 
tion 7 in Table 1, above, may be used. In this case, the coding process of the base layer may be the same as a non- 
scaleable encoding process, e.g.. such as described in the MPEG-4 VM non-scaleable coding or MPEG-2 Main Profile 
at Main Level standard, while adaptive frame/field DCT coding is used in the enhancement layer. 

In a further application of the present invention, an asynchronous transfer mode (ATM) communication technique 
is presented. Generally, the trend towards transmission of video signals over ATM networks is rapidly growing. This is 
due to the variable bit rate (VBR) nature of these networks which provides several advantages over constant bit rate 
(CBR) transmissions. For example, in VBR channels, an approximately constant picture quality can be achieved. More- 
over, video sources in ATM networks can be statistically multiplexed, requiring a lower transmission bit rate than if they 
are transmitted through CBR channels since the long term average data rate of a video signal is less than the short term 
average due to elastic buffering in CBR systems. 

However, despite the advantages of ATM networks, they suffer from a major deficiency of congestion. In congested 
networks, video packets are queued to find an outgoing route. Long-delayed packets may arrive too late to be of any 
use in the receiver, and consequently are thrown away by the decoder. The video codec then must be designed to with- 
stand packet losses. 

In order to make the video coder almost immune to packet losses, the temporal-spatial scaleable coding techniques 
of the present invention can be used. In particular, video data from the base layer can be transmitted with a high priority 
and accommodated in a guaranteed bit rate of an ATM network. Video data packets from the enhancement layer may 
be lost if congestion arises since a channel is not guaranteed. If the enhancement layer packets are received, picture 
quality is improved. A coding scheme using configuration 1 of Table 1, above, may be used to achieve this result. The 
scheme may be achieved as shown in FIGURE 4, discussed previously in connection with prediction modes, where the 
base layer is the high-priority layer. Tnus, higher priority, lower bit rate data is communicated in the base layer, and lower 
priority, higher bit rate data is communicated in the enhancement layer. 

Similarly, such scaleable coding can also be used in video coding and transmission over the Internet, intranets and 
other communication networks. 
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Accordingly, it can be seen that the present invention provides a method and apparatus tor providing temporal and 
spatial scaling of video images including video object planes (VOPs) in a digital video sequence. In one aspect of the 
invention, coding efficiency is improved by adaptively compressing a scaled field mode input video sequence. Upsam- 
pled VOPs in the enhancement layer are reordered to provide a greater correlation with the original video sequence 
based on a linear criteria. The resulting residue is coded using a spatial transformation such as the OCT. In another 
aspect of the invention, a motion compensation scheme is presented for coding enhancement layer VOPs by scaling 
motion vectors which have already been determined for the base layer VOPs. A reduced search area is defined whose 
center is defined by the scaled motion vectors. The technique is suitable for use with a scaled frame mode or field mode 
input video sequence. 

Additionally, various codec processor configurations were presented to achieve particular scaleable coding results. 
Applications of scaleable coding, including stereoscopic video, picture-in-picture, preview access channels, and ATM 
communications, were also discussed. 

Although the invention has been described in connection with various specific embodiments, those skilled in the art 
will appreciate that numerous adaptations and modifications may be made thereto without departing from the spirit and 
scope of the invention as set forth in the claims. For example, while two scalability layers were discussed, more than 
two layers may be provided. Moreover, while rectangular or square VOPs may have been provided in some of the fig- 
ures for simplicity, the invention is equally suitable for use with arbitrarily-shaped VOPs. 

Claims 

1. A method for scaling an input video sequence comprising video object planes (VOPs) for communication in a cor- 
responding base layer and enhancement layer, said VOPs in said input video sequence having an associated spa- 
tial resolution and temporal resolution, comprising the steps of: 

downsampling pixel data of a first particular one of said VOPs of said input video sequence to provide a first 
base layer VOP having a reduced spatial resolution; 

upsampling pixel data of at least a portion of said first base layer VOP to provide a first upsampled VOP in said 
enhancement layer; and 

differentially encoding said first upsampled VOP using said first particular one of said VOPs of said input video 
sequence for communication in said enhancement layer at a temporal position corresponding to said first base 
layer VOP. 

2. The method of claim 1, wherein said VOPs in said input video sequence are field mode VOPs. and said differen- 
tially encoding step comprises the further steps of: 

reordering lines of said pixel data of said first upsampled VOP In a field mode if said lines of pixel data meet a 
reordering criteria; then 

determining a residue according to a difference between pixel data of said first upsampled VOP and pixel data 
of said first particular one of said VOPs of said input video sequence; 
and spatially transforming said residue to provide transform coefficients. 

3. The method of claim 2, wherein: 

said lines of pixel data of said first upsampled VOP meet said reordering criteria when a sum of differences of 
luminance values of opposite-field lines is greater than a sum of differences of luminance data of same-field 
lines and a bias term. 

4. The method of any of the preceding claims, comprising the further steps of: 

downsampling pixel data of a second particular one of said VOPs of said input video sequence to provide a 
second base layer VOP having a reduced spatial resolution; 

upsampling pixel data of at least a portion of said second base layer VOP to provide a second upsampled VOP 
in said enhancement layer which corresponds to said first upsampled VOP; 

using at least one of said first and second base layer VOPs to predict an intermediate VOP corresponding to 
said first and second upsampled VOPs; and 

encoding said intermediate VOP for communication in said enhancement layer at a temporal position which is 
intermediate to that of said first and second upsampled VOPs. 
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5. The method of claim 4, wherein: 

said enhancement layer has a higher temporal resolution than said base layer; and 
said base and enhancement layer are adapted to provide at least one of: 

(a) a picture-in-picture (PIP) capability wherein a PIP image is carried in said base layer, and 

(b) a preview access channel capability wherein a preview access image is carried in said base layer. 

6. The method of any of the preceding claims, wherein: 

said base layer is adapted to carry higher priority, lower brt rate data, and said enhancement layer is adapted 
to carry lower priority, higher bit rate data. 

7. A method for scaling an input video sequence comprising video object planes (VOPs) for communication in a cor- 
responding base layer and enhancement layer, said VOPs in said input video sequence having an associated spa- 
tial resolution and temporal resolution, comprising the steps of: 

providing a first particular one of said VOPs of said input video sequence for communication in said base layer 
as a first base layer VOP. 

downsampling pixel data of at least a portion of said first base layer VOP for communication in said enhance- 
ment layer as a first downsampled VOP at a temporal position corresponding to said first base layer VOP; 
downsampling corresponding pixel data of said first particular one of said VOPs to provide a comparison VOP; 
and 

differentially encoding said first downsampled VOP using said comparison VOR 

8. The method of claim 7, comprising the further steps of: 

differentially encocfing said first base layer VOP using said first particular one of said VOPs by: 

determining a residue according to a difference between pixel data of said first base layer VOP and pixel data 

of said first particular one of said VOPs; 

and spatially transforming said residue to provide transform coefficients. 

9. The method of claim 8, wherein said VOPs in said input video sequence are field mode VOPs, and said first base 
layer VOP is differentially encoded by the steps of: 

reordering lines of said pixel data of said first base layer VOP in a field mode prior to said determining step if 
said lines of pixel data meet a reordering criteria. 

10. The method of claim 9, wherein: 

said lines of pixel data of said first base layer VOP meet said reordering criteria when a sum of differences of 
luminance values of opposite-field lines is greater than a sum of differences of luminance data of same-field 
lines and a bias term. 

11. Trie method of any of claims 7 to 10, comprising the further steps of: 

providing a second particular one of said VOPs of said input video sequence for communication in said base 
layer as a second base layer VOP; 

downsampling pixel data of at least a portion of said second base layer VOP for communication in said 
enhancement layer as a second downsampled VOP at a temporal position corresponding to said second base 
layer VOP; 

downsampling corresponding pixel data of said second particular one of said VOPs to provide a comparison 
VOP; 

differentially encocfing said second downsampled VOP using said Comparison VOP; 

using at least one of said first and second base layer VOPs to predict an intermediate VOP corresponding to 

said first and second downsampled VOPs; and 

encoding said intermediate VOP for communication in said enhancement layer at a temporal position which is 
intermediate to that of said first and second upsampled VOPs. 
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12. The method of any of claims 7 to 1 1 , wherein: 

the base and enhancement layers are adapted to provide a stereoscopic video capability in which image data 
in the enhancement layer has a lower spatial resolution than image data in the base layer. 

1 3. A method tor coding a bi-directionally predicted video object plane (B-VOP), comprising the steps of: 

scaling an input video sequence comprising video object planes ( VOPs) for communication in a corresponding 
base layer and enhancement layer: 

providing first and second base layer VOPs in said base layer which correspond to said input video sequence 
VOPs; 

said second base layer VOP being precficted from said first base layer VOP according to a motion vector MV p ; 
providing said B-VOP in said enhancement layer at a temporal position which is intermediate to that of said first 
and second base layer VOPs; and 
encoding said B-VOP using at least one of: 

(a) a forward motion vector M V, and 

(b) a backward motion vector MV& obtained by scaling said motion vector MVp. 

14. The method of claim 13, wherein: 

a temporal distance TRp separates said first and second base layer VOPs; 
a temporal distance TR B separates said first base layer VOP and said B-VOP; 

m/n is a ratio of the spatial resolution of the first and second base layer VOPs to the spatial resolution of the B- 
VOP; and 
at least one of: 

(a) said forward motion vector MV f is determined according to the relationship 
MV f =(rrVn) *TR B • MV p /TR p ; and 

(b) said backward motion vector MV b is determined accordng to the relationship 
MV b =(m/n) • (TR B -TR p ) • MV p /TR p . 

15. The method of claim 13 or 14, comprising the further step of: 

encoding said B-VOP using at least one of: 

(a) a search region of said first base layer VOP whose center is determined according to said forward 
motion vector MV ( ; and 

(b) a search region of said second base layer VOP whose center is determined according to said backward 
motion vector MV B . 

16. A method for recovering an input video sequence comprising video object planes (VOPs) which was scaled and 
communicated in a corresponding base layer and enhancement layer, said VOPs in said input video sequence hav- 
ing an associated spatial resolution and temporal resolution, wherein: 

pixel data of a first particular one of said VOPs of said input video sequence is downsampled and carried as a 
first base layer VOP having a reduced spatial resolution; 

pixel data of at least a portion of said first base layer VOP is upsampled and carried as a first upsampled VOP 
in said enhancement layer at a temporal position corresponding to said first base layer VOP; and 
said first upsampled VOP is cSfferentially encoded using said first particular one of said VOPs of said input 
video sequence; 

said method comprising the steps of: 

upsampling said pixel data of said first base layer VOP to restore said associated spatial resolution; and 
processing said first upsampled VOP and said first base layer VOP with said restored associated spatial 
resolution to provide an output video signal with said associated spatial resolution. 

17. The method of claim 16, wherein: 
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said VOPs in said input video sequence are field mode VOPs; and 

said first upsampled VOP is differentially encoded by reordering lines of said pixel data of said first upsampled 
VOP in a field mode if said lines of pixel data meet a reordering criteria, then determining a residue according 
to a difference between pixel data of said first upsampled VOP and pixel data of said first particular one of said 
VOPs of said input video sequence, and spatially transforming said residue to provide transform coefficients. 

18. The method of claim 1 7, wherein: 

said lines of pixel data of said first upsampled VOP meet said reordering criteria when a sum of differences of 
luminance values of opposite-field lines is greater than a sum of differences of luminance data of same-field 
lines and a bias term. 

19. The method of any of claims 16 to 18, wherein: 

a second particular one of said VOPs of said input video sequence is downsampled to provide a second base 
layer VOP having a reduced spatial resolution; 

pixel data of at least a portion of said second base layer VOP is upsampled to provide a second upsampled 
VOP in said enhancement layer which corresponds to said first upsampled VOP; 

at least one of said first and second base layer VOPs is used to predict an intermediate VOP corresponding to 
said first and second upsampled VOPs; and 

said intermediate VOP is encoded for communication in said enhancement layer at a temporal position which 
is intermediate to that of said first and second upsampled VOPs. 

20. The method of claim 19, wherein: 

said enhancement layer has a higher temporal resolution than said base layer; and 
said base and enhancement layer are adapted to provide at least one of: 

(a) a picture-in-picture (PIP) capability wherein a PIP image is carried in said base layer, and 

(b) a preview access channel capability wherein a preview access image is carried in said base layer. 

21. The method of any of claims 16 to 20, wherein: 

said base layer is adapted to carry higher priority, lower bit rate data, and said enhancement layer is adapted 
to carry lower priority, higher bit rate data. 

22. A method for recovering an input video sequence comprising video object planes (VOPs) which was scaled and 
communicated in a corresponding base layer and enhancement layer, said VOPs in said input video sequence hav- 
ing an associated spatial resolution and temporal resolution, wherein: 

a first particular one of said VOPs of said input video sequence is provided in said base layer as a first base 
layer VOP; 

pixel data of at least a portion of said first base layer VOP is downsampled and carried in said enhancement 
layer as a first downsampled VOP at a temporal position corresponding to said first base layer VOP; 
corresponding pixel data of said first particular one of said VOPs is downsampled to provide a comparison 
VOP; and 

said first downsampled VOP is differentially encoded using said comparison VOP; 
said method comprising the steps of: 

upsampling said pixel data of said first downsampled VOP to restore said associated spatial resolution; 
and 

processing said first enhancement layer VOP with said restored associated spatial resolution and said first 
base layer VOP to provide an output video signal with said associated spatial resolution. 

23. The method of claim 22, wherein: 

said first base layer VOP is differentially encoding using said first particular one of said VOPs by determining 
a residue according to a difference between pixel data of said first base layer VOP and pixel data of said first 



18 



EP0883 300A2 



particular one of said VOPs, and spatially transforming said residue to provide transform coefficients. 

24. The method of claim 23, wherein: 

said VOPs in said input video sequence are field mode VOPs, and said first base layer VOP is differentially 
encoded by reordering lines of said pixel data of said first base layer VOP in a field mode prior to determining 
said residue if said lines of pixel data meet a reordering criteria 

25. The method of claim 24, wherein: 

said lines of pixel data of said first base layer VOP meet said reordering criteria when a sum of differences of 
luminance values of opposrte-f iekj lines is greater than a sum of differences of luminance data of same-field 
lines and a bias term. 

26. The method of any of claims 22 to 25. wherein: 

a second particular one of said VOPs of said input video sequence is provided in said base layer as a second 
base layer VOP; 

pixel data of at least a portion of said second base layer VOP is downsampled and carried in said enhancement 
layer as a second downsampled VOP at a temporal position corresponding to said second base layer VOP; 
corresponding pixel data of said second particular one of said VOPs is downsampled to provide a comparison 
VOP; 

said second downsampled VOP is differentially encoded using said comparison VOP; 

at least one of said first and second base layer VOPs is used to predict an intermediate VOP corresponding to 

said first and second upsampled VOPs; and 

said intermediate VOP is encoded for communication in said enhancement layer at a temporal position which 
is intermediate to that of said first and second upsampled VOPs. 

27. The method of any of claims 22 to 26, wherein: 

said base and enhancement layer are adapted to provide a stereoscopic video capability in which image data 
in said enhancement layer has a lower spatial resolution than image data in said base layer. 

28. A method for recovering an input video sequence comprising video object planes (VOPs) which was scaled and 
communicated in a corresponding base layer and enhancement layer in a data stream, said VOPs in said input 
video sequence having an associated spatial resolution and temporal resolution, wherein: 

first and second base layer VOPs are provided in said base layer which correspond to said input video 
sequence VOPs; 

said second base layer VOP is predicted from said first base layer VOP according to a motion vector MV p ; 

a bi-directionally predicted video object plane (B-VOP) is provided in said enhancement layer at a temporal 

position which is intermediate to that of said first and second base layer VOPs; and 

said B-VOP is encoded using a forward motion vector MV f and a backward motion vector MV B which are 

obtained by scaling said motion vector MV p ; 

said method comprising the steps of: 

recovering sad forward motion vector MV f and said backward motion vector MV B from said data stream; 
and 

decoding said B-VOP using said forward motion vector MV f and said backward motion vector MV B . 

29. The method of claim 28, wherein: 

a temporal distance TRp separates said first and second base layer VOPs; 
a temporal distance TR B separates said first base layer VOP and said B-VOP; 

m/n is a ratio of the spatial resolution of the first and second base layer VOPs to the spatial resolution of the B- 

VOPiand 

at least one of: 
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(a) said forward motion vector MV, is determined according to the relationship 
MV f =(rrVn).TR B .MV p /TR p ;and 

(b) said backward motion vector MV b is determined according to the relationship 
MV b *(nVh) • (TR B -TR p ) . MV p /TR p . 

30. The method of claim 28 or 29, wherein: 

sad B-VOP is encoded using at least one of: 

(a) a search region of said first base layer VOP whose center is determined according to said forward 
motion vector MV ( ; and 

(b) a search region of said second base layer VOP whose center is determined according to said backward 
motion vector MV B . 

31. A decoder apparatus tor recovering an input video sequence comprising video object planes (VOPs) which was 
scaled and communicated in a corresponding base layer and enhancement layer, said VOPs in said input video 
sequence having an associated spatial resolution and temporal resolution, wherein: 

pixel data of a first particular one of said VOPs of said input video sequence is downsampled and carried as a 
first base layer VOP having a reduced spatial resolution; 

pixel data of at least a portion of said first base layer VOP is upsampled and carried as a first upsampled VOP 

in said enhancement layer at a temporal position corresponding to said first base layer VOP; and 

said first upsampled VOP is differentially encoded using said first particular one of said VOPs of said input 

video sequence; 

said apparatus comprising: 

means for upsampling said pixel data of said first base layer VOP to restore said associated spatial reso- 
lution; and 

means for processing said first upsampled VOP and said first base layer VOP with said restored associ- 
ated spatial resolution to provide an output video signal with said associated spatial resolution. 

32. The apparatus of claim 31 , wherein: 

said VOPs in said input video sequence are field mode VOPs; and 

said first upsampled VOP is differentially encoded by reordering lines of said pixel data of said first upsampled 
VOP in a field mode if said lines of pixel data meet a reordering criteria, then determining a residue according 
to a difference between pixel data of said first upsampled VOP and pixel data of said first particular one of said 
VOPs of said input video sequence, and spatially transforming said residue to provide transform coefficients. 

33. The apparatus of daim 31 or 32, wherein: 

said lines of pixel data of said first upsampled VOP meet said reordering criteria when a sum of differences of 
luminance values of opposite-field lines is greater than a sum of differences of luminance data of same-field 
lines and a bias term. 

34. The apparatus of any of claims 31 to 33, wherein: 

a second particular one of said VOPs of said input video sequence is downsampled to provide a second base 
layer VOP having a reduced spatial resolution; 

pixel data of at least a portion of said second base layer VOP is upsampled to provide a second upsampled 
VOP in said enhancement layer which corresponds to said first upsampled VOP; 

at least one of said first and second base layer VOPs is used to predict an intermediate VOP corresponding to 
said first and second upsampled VOPs; and 

said intermediate VOP is encoded for communication in said enhancement layer at a temporal position which 
is intermediate to that of said first and second upsampled VOPs. 

35. The apparatus of claim 34, wherein: 
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said enhancement layer has a higher temporal resolution than said base layer; and 
said base and enhancement layers are adapted to provide at least one of: 

(a) a picture-in-picture (PIP) capability wherein a PIP image is carried in said base layer, and 

(b) a preview access channel capability wherein a preview access image is carried in said base layer. 

36. The apparatus of any of claims 31 to 35. wherein: 

said base layer is adapted to carry higher priority, lower bit rate data, and said enhancement layer is adapted 
to carry lower priority, higher bit rate data. 

37. A decoder apparatus for recovering an input video sequence comprising video object planes (VOPs) which was 
scaled and communicated in a corresponding base layer and enhancement layer, said VOPs in said input video 
sequence having an associated spatial resolution and temporal resolution, wherein: 

a first particular one of said VOPs of said input video sequence is provided in said base layer as a first base 
layer VOP; 

pixel data of at least a portion of said first base layer VOP is downsampled and carried in said enhancement 
layer as a first downsampled VOP at a temporal position corresponding to said first base layer VOP; 
corresponding pixel data of said first particular one of said VOPs is downsampled to provide a comparison 
VOP; and 

said first downsampled VOP is differentially encoded using said comparison VOP; 
said apparatus comprising: 

means for upsampling said pixel data of said first downsampled VOP to restore said associated spatial res- 
olution; and 

means for processing said first enhancement layer VOP with said restored spatial resolution and said first 
base layer VOP to provide an output video signal with said associated spatial resolution. 

38. The apparatus of claim 37, wherein: 

said first downsampled VOP is differentially encoding by determining a residue according to a difference 
between pixel data of said first downsampled VOP and pixel data of said first particular one of said VOPs of 
said input video sequence, and spatially transforming said residue to provide transform coefficients. 

39. The apparatus of claim 38, wherein: 

said VOPs in said input video sequence are field mode VOPs, and said first base layer VOP is differentially 
encoded by reordering lines of said pixel data of said first base layer VOP in a field mode prior to determining 
said residue if said lines of pixel data meet a reordering criteria. 

40. The apparatus of claim 39, wherein: 

said lines of pixel data of said first base layer VOP meet said reordering criteria when a sum of differences of 
luminance values of opposite-field lines is greater than a sum of differences of luminance data of same-field 
lines and a bias term. 

41. The apparatus of any of claims 37 to 40, wherein: 

a second particular one of said VOPs of said input video sequence is provided for communication in said base 
layer as a second base layer VOP; 

pixel data of at least a portion of said second base layer VOP is downsampled to provide a second downsam- 
pled VOP in said enhancement layer which corresponds to said first upsampled VOP; 
at least one of said first and second base layer VOPs is used to predict an intermediate VOP corresponding to 
said first and second downsampled VOPs; and 

said intermediate VOP is encoded for communication in said enhancement layer at a temporal position which 
is intermediate to that of said first and second upsampled VOPs. 
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42. The apparatus of any of claims 37 to 41 , wherein: 

said base and enhancement layer are adapted to provide a stereoscopic video capability in which image data 
in said enhancement layer has a lower spatial resolution than image data in said base layer. 

43. A decoder apparatus tor recovering an input video sequence comprising video object planes (VOPs) which was 
scaled and communicated in a corresponding base layer and enhancement layer in a data stream, said VOPs in 
said input video sequence having an associated spatial resolution and temporal resolution, wherein: 

first and second base layer VOPs which correspond to said input video sequence VOPs are provided in said 
base layer; 

said second base layer VOP is predicted from said first base layer VOP according to a motion vector MV p ; 

a bi-directionally predicted video object plane (B-VOP) is provided in said enhancement layer at a temporal 

position which is intermediate to that of said first and second base layer VOPs; and 

said B-VOP is encoded using a forward motion vector MV f and a backward motion vector MV B which are 

obtained by scaling said motion vector MV p ; 

said apparatus comprising: 

means for recovering said forward motion vector MV f and said backward motion vector MV B from said data 
stream; and 

means for decoding said B-VOP using said forward motion vector MV, and said backward motion vector 
MV B . 

44. The apparatus of claim 43, wherein: 

a temporal distance TRp separates said first and second base layer VOPs; 
a temporal distance TR B separates said first base layer VOP and said B-VOP; 

m/n is a ratio of the spatial resolution of the first and second base layer VOPs to the spatial resolution of the B- 

VOP;and 

at least one of: 

(a) said forward motion vector MV f is determined according to the relationship 
MV ( =(m/n)-TR B -MVp/TRpjand 

(b) said backward motion vector MV b is determined according to the relationship 
MV b =(nVn) • (TR B -TR p ) • MV p /TR p . 

45. The apparatus of claim 43 or 44, wherein: 

said B-VOP is encoded using at least one of: 

(a) a search region of said first base layer VOP whose center is determined according to said forward 
motion vector MV f ; and 

(b) a search region of said second base layer VOP whose center is determined according to said backward 
motion vector MV B . 
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